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STATISTICS  AS  A MATHEMATICAL  DISCIPLINE 


D.  V.  Lindley 


(This  paper  is  a revised  version  of  a lecture  given  to  a general 
mathematical  audience  at  the  First  Australasian  Mathematical  Convention 
held  in  Christchurch,  15-19  May  1978.  The  revision  consists  of  increas- 
ing the  interpretative  remarks  at  the  cost  of  more  mathematical  ones; 
the  judgment  being  that  these  will  be  of  more  interest  to  the  readers 
of  this  journal.  The  example  which  concluded  the  lecture  has  been 
replaced  by  a different  one,  though  emphasizing  the  same  point,  because 
the  problem  with  which  it  deals  has  been  discussed  recently  in  Australasia, 
in  particular  by  Davies  (1978)  at  the  conference.  The  example  originally 
given  in  the  lecture  has  appeared  in  Lindley  and  Phillips  (1976) . I am 
grateful  for  the  award  of  an  Erskine  Fellowship  by  the  University  of 
Canterbury  which  enabled  me  to  attend  the  convention  and  profit  from  a 
stay  on  the  campus.) 
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STATISTICS  AS  A MATHEMATICAL  DISCIPLINE 


D.  V.  Lindley 

1.  The  Axioms  of  Statistics. 

Mathematicians  who  study  statistics,  perhaps  through  having  to  give 
a course  on  it,  are  nearly  always  surprised  by  what  they  find.  (Actually 
"surprise"  is  perhaps  the  politest  term  used:  the  adjective  is  often 
derogatory.)  This  is  reasonable,  because  they  do  not  find  a discipline 
akin  to  what  they  are  used  to  in  other  branches  of  mathematics.  They  do 
not  find  a system  that  starts  from  fundamental  notions  incorporated  in 
axioms,  proceeding  through  definitions  to  theorems  expressing  important 
results  in  the  field.  Instead  they  find  a collection  of  loosely  related 
techniques,  such  as  confidence  intervals,  significance  tests;  and  a 
resulting  confusion  about  what  to  do  in  any  particular  case.  The  purpose 
of  this  paper  is  to  show  that  statistics  can  be  regarded  as  a formal, 
mathematical  system,  like  Euclidean  geometry,  so  that  mathematicians  need 
no  longer  be  "surprised". 

Of  course,  I am  not  advocating  this  formalism  just  to  keep  mathemati- 
cians happy.  That  is  a worthy  objective,  but  not  as  important  as  the 
consideration  that  the  mathematical  method  (if  I can  call  it  that,  without 
being  too  precise  about  what  is  meant)  works  extremely  well,  and  it  is 
reasonable  to  expect  advantages  from  applying  it  to  statistical  problems. 
With  that  method  we  know  exactly  what  is  true  and  what  is  false.  In  any 
problem  we  know  what  to  do  - in  Newtonian  mechanics  write  down  the  equa- 
tions of  motion  - and  thought  can  be  concentrated  on  how  to  do  it.  Indeed, 
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the  main  advantage  of  this  approach  to  statistics  is  that  it  gives  prac- 
tically useful  results  and  is  of  just  as  much  interest  to  the  practitioner 
of  statistics  as  to  the  mathematical  statistician.  We  shall  also  look  at 
an  exciting  theorem  that  is  of  great  practical  importance. 

We  begin,  like  Euclidean  geometry,  with  the  axioms.  There  is  naturally 
freedom  of  choice  here,  but  I shall  take  a system  used  by  DeGroot  (1970)  in 
his  excellent  text.  First  it  is  necessary  to  say  what  is  under  discussion: 
the  equivalent  of  the  "points"  and  "lines"  of  geometry.  In  statistics  we 
refer  to  events  and  we  try  to  capture  feelings  of  uncertainty  that  we  have 
about  events.  For  example,  the  event  that  a treatment  will  increase  the 
yield.  Statistical  inference  is  a way  of  handling  uncertainty.  The  events 
are  denoted  A,  B,  C,  ...  and  they  will  be  supposed  to  form  a a-field  in  a 
space  S.  This  restriction  merely  means  that  the  events  can  always  be 
combined,  to  form  a new  event,  in  the  usual  ways.  Our  first  axiom  intro- 
duces the  relation  "not  more  likely  than"  between  events,  and  is  written 
<• 

Al:  For  any  two  events  A,  B;  either  A < B,  B < A or  both. 

*v  rw 

This  says  that  any  two  events  can  be  compared.  In  the  usual  way  we  obtain 
A ~ B,  A and  B are  equally  likely,  and  A B,  A is  less  likely  than 

B.  The  second  axiom  shows  what  happens  to  the  relation  when  events  are 
combined  in  certain  ways. 

A2:  If  A, A = B,B_  = <J>  and  A.  ^ B . (i=l,  2)  then  UA.  < UB 
Furthermore  A,  -K  B,  and  A B imply  UA.  •<  UB.  . 

Here  two  events  are  each  broken  up  into  component  events:  if  the  compon- 
ents of  one  event  are  not  more  likely  than  the  components  of  the  other, 
then  the  same  relation  holds  between  the  original  events.  It  is  easy  to 
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deduce  that  these  two  axioms  imply  transitivity,  namely  A •<  B and 

/v 

B < C imply  A -<  C,  and  this  may  be  taken  as  an  axiom  if  preferred 
though  it  is  weaker  than  the  present  one. 

Other  axioms  follow,  but  these  two  are  the  key  ones  and  it  is  worth 
making  a few  comments  on  them.  First,  the  axioms  are  normative , not  des- 
criptive: that  is,  they  refer  to  a norm  or  standard  of  behavior  that  you 
would  like  to  achieve  if  you  knew  how;  they  do  not  describe  your  abilities 
at  the  moment.  The  task  of  statistics  is  to  provide  a satisfactory  way  of 
measuring  uncertainty;  the  suggestion  is  that  such  measurements  would  obey 
the  two  requirements  so  far  set  out.  Without  statistics  you  may  not  be 
able  to  effect  the  required  comparisons  of  events.  Second,  the  axioms  can 
be  given  an  operational  test  in  terms  of  bets:  A ■<  B if  a bet  to  win 
when  B is  true  is  preferred  to  a bet  to  win  the  same  amount  when  A is 
true.  This  operational  interpretation  is  important  in  order  that  the 
system  can  be  used.  Without  it  the  system  remains  pure  mathematics. 

The  third  axiom  relates  ^ to  the  concept  of  inclusion,  or  impli- 
cation,  for  events: 

A3:  <J>  .<  A and  <J)  .<  S. 

A consequence  of  this  is  that  A cz  B implies  A B:  if  A implies 
B,  then  A is  not  more  likely  than  B. 

The  fourth  axiom  is  a matter  of  some  contention.  It  is  introduced 
in  order  to  extend  the  concepts  of  the  first  three  axioms  from  finite 
collections  of  events  to  enumerable  collections.  It  is  possible  to 
dispense  with  it,  but  it  is  certainly  simpler  to  include  it. 

A4:  If  A Z)  A z>  . . . and  A,  >-  B for  all  i,  then  HA.  >■  B. 

J.  Z 1 1 /v 

In  words,  if  events  get  more  and  more  restrictive  but  are  never  less 

likely  than  B,  then  the  total  restriction  is  not  less  likely  than  B. 

In  technical  language,  this  leads  to  a-additivity. 
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Our  object  is  to  measure  uncertainty;  that  is,  to  associate  with 
each  event  a number  describing  that  uncertainty.  It  had  been  conjectured 
that  these  four  axioms  would  be  enough  to  do  this,  but  a counterexample 
showed  that  this  was  not  so.  They  have  therefore  to  be  strengthened. 

There  are  various  ways  of  doing  this,  but  all  of  them  essentially  amount 
to  introducing  a standard  of  uncertainty  to  which  all  events  can  be 
referred.  Indeed,  in  any  measurement  process,  say  of  length  or  mass, 
measurement  is  always  with  respect  to  a standard;  so  it  is  by  no  means 
unusual  to  introduce  the  same  concept  in  connection  with  uncertainty. 

It  is  simplest  to  invoke  a strong  standard  that  brings  continuity  along 
with  it.  This  we  do  by  supposing  the  cr-field  contains  the  Borel  sets  of 
the  unit  interval  [0,  1]  of  the  real  line  and  presume: 

A5:  There  exists  a uniform  probability  distribution  in  [0,  1], 

This  is  our  standard.  By  a uniform  distribution  is  meant  an  assignment 
of  uncertainty  such  that  if  1^  and  I ^ are  any  two  intervals  in  [0,  1] 
of  equal  length  then  they  are  judged  equally  likely,  1^  ~ 1^. 

It  is  now  a straightforward  matter  to  use  a Dedekind-type  of  argument 
to  show  that  for  any  A,  there  exists  an  interval  I,  only  the  length  of 
which  matters,  such  that  A ~ I.  With  each  A is  associated  a number, 
equal  to  the  length  of  the  corresponding  I and  called  the  probability 
of  A.  Furthermore  it  can  be  shown  that  these  numbers  obey  the  usual 
laws  of  probability  discussed  below,  and  therefore  are  entitled  to  the 
name  probability.  These  probabilities  are  written  p(A|s)  - read  "the 
probability  of  A given  S"  - since  they  depend  on  S as  well  as  the 
uncertain  event  under  consideration.  An  example  will  demonstrate  the 
truth  of  this  latter  point.  Suppose  S contains  the  (n+1)  integers 
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(0,  1,  2,  n)  corresponding  to  the  number  of  heads  in  n tosses  of 

a coin:  then  your  probability  of  A,  exactly  2 heads,  will  be  altered 
if  S is  restricted  to  include  only  the  even  integers. 

Our  final  axiom  is  concerned  with  such  a restriction  of  S to  a 
subset  C say,  with  C >■  <j>.  We  introduce  a notion  of  A is  not  more 
likely  than  B,  given  C and  write  (A|C)  (B|C),  the  original  form 

obtaining  when  C=S.  These  ordering  relations  are  connected  with  the 
original  ordering  by 

A6:  For  any  A,B  and  C,  C ^ cj>,  (A  | C)  (B  | C)  iff  AC  -<  BC. 

The  motivation  behind  this  assumption  is  best  explained  in  terms  of  the 
operational  bets  referred  to  above.  Suppose  that  you  are  contemplating 
a bet  which  will  only  take  place  if  C occurs  and  then  gives  a reward 
only  if  A does,  and  are  comparing  it  with  a second  bet  under  the  same 
conditions  except  that  the  reward  hinges  on  B.  Then  the  rewards  will 
only  arise  if  A and  C,  or  B and  C,  both  occur.  Consequently  the 
relevant  uncertainties  concern  AC  and  BC.  The  bets  are  "called-of f" 
if  C does  not  occur,  so  this  is  sometimes  referred  to  as  the  axiom  of 
called-of f bets. 

These  six  axioms  complete  the  system  and  from  them  it  is  possible 
to  prove  the  basic 

Theorem:  A1  - A6  imply  that  there  exists  a unique  probability  dis- 
tribution p(a|b)  for  all  A,  and  all  B ^ <j>,  such  that  (A | C ) < (B | C) 

'V 

iff  p(A|c)  < p(b|c). 

By  a probability  distribution  is  meant  an  assignment  of  numbers, 
written  p(A|B),  to  all  events  A,  and  all  events  B with  p (B | S)  > 0 
such  that 
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PI:  p(A|A)  = 1 and  0 <_  p(A|B)  <_  1.  (Convexity) 

P2:  If  {A^:  i=l,  2,  ...}  are  exclusive,  A^A..  = $ for  all  unequal 
i,j,  then  p(UA^|B)  = Zp(A^jB).  (Additivity) 

P3:  p(ABjc)  = p(A|c)  p(B|AC).  (Multiplicativity) 

(The  three  properties  are  often  described  by  the  names  given  in  brackets.) 

2.  The  Likelihood  Principle 

In  words,  the  theorem  says  that  an  appreciation  of  uncertainty  within 
the  framework  of  the  axioms  A1  - A6  is  only  possible  through  the  notion  of 
probability:  uncertainty  can  only  be  described  probabilistically.  To 
use  any  other  method  that  is  not  equivalent  to  a probability  calculus  will 
lead  to  violation  of  one  of  the  axioms.  Would  you  wish  to  do  that?  This 
is  a striking  result  since,  as  we  shall  see  below,  statisticians  do  use 
other  methods  and  get  into  difficulties  as  a result.  Since  the  proof  of 
the  theorem  is  simple,  the  only  objection  to  the  approach  can  be  in  the 
axioms:  are  they  satisfactory?  The  one  that  is  most  obviously  open  to 
attack  is  A4  with  its  possibly  infinite  set  of  events;  but  if  it  is  omitted 
PI  - P3  remain  except  that  in  P2  the  events  can  only  be  finite  in  number 
and  the  probability  is  finitely-,  and  not  a-,  additive.  This  leads  to 
difficulties:  for  example,  the  marginalization  paradoxes  of  Dawid,  Stone 

and  Zidek  (1973).  Another  reason  for  thinking  that  the  system  is  satis- 
factory is  that  it  is  possible  to  approach  the  topic  of  uncertainty  in 
other  ways  and  still  end  up  with  PI  - P3  or  variants  thereof.  We  cite 
the  work  of  DeFinetti  (1974).  More  modest  approaches  that  deal  with 
special  systems  lead  to  the  likelihood  principle  that  we  will  discuss 
below:  see,  for  example,  Birnbaum  (1962)  or  Basu  (1975).  Another  aspect 
of  this  approach  to  uncertainty  is  through  bets,  as  already  mentioned. 
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It  is  possible  to  show  that  any  measurement  of  uncertainty  that  can  be 
used  as  a basis  for  bets  and  is  not  probabilistic  results  in  a Dutch 


book:  that  is  a combination  of  bets  that  will  lose  you  money  for  sure, 

whatever  happens.  Dutch  books  can  be  made  for  most  statistical  practices. 

The  argument  is  like  that  used  in  other  branches  of  mathematics. 
Furthermore  it  is  operational.  One  can  test  people  by  means  of  bets  to 
see  if  they  react  in  the  way  the  results  require:  it  is  good  applied, 
as  well  as  pure,  mathematics.  Furthermore  we  are  now  in  a position  to 
prove  other  theorems  and  can  test  these  against  practical  experience. 

One  such  theorem  is  Bayes  theorem  which  says  that  whenever  the  prob- 
abilities exist  p(BiJAC)  « pCA^C)  p(B^|C)  for  i=l,  2,  ...,  the 
omitted  constant  of  proportionality  not  depending  on  i.  This  theorem 
plays  such  an  important  role  in  modern  statistics  that  methodology  based 
on  the  axioms  is  often  called  Bayesian  statistics.  It  is  better  to  refer 
to  the  axioms  as  those  of  coherence  because  they  typically  deal  with  how 
judgments  of  uncertainty  get  together,  or  cohere;  so  that  perhaps  the 
appropriate  description  is  coherent  statistics . But  Bayes  theorem  does 
have  an  astonishing  consequence  for  statistics  that  we  now  explore. 

In  statistical  problems  S is  a product  space  X x 0 of  elements 
(x,0).  (For  simplicity  we  will  describe  the  situation  where  S is 
enumerable.  The  results  generalize  to  more  general  classes  of  spaces.) 
The  quantity  x is  referred  to  as  the  data  and  0 as  the  parameter. 

The  probabilities,  given  S,  are  most  easily  described  by  a density 
p(x,0),  never  negative  and  with  E^  q p(x,0)  = 1.  Then  p(A|S) 

* X(x  e)£A  p(x»0)  311(1  generally  p(A|B)  is  p(AB|S)/p(B|s)  by  P3. 

In  particular  p(0)  = E^  p(x,0)  and  p(x[0)  = p(x,0)/p(0) 
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provided  p(9)  4 0,  which  will  henceforth  be  supposed  true.  (Conventional 
statistics  would  admit  p(x|0)  but  not  p(9).)  The  point  of  writing  S 
in  this  way  is  that  the  statistician  observes  the  value  of  x,  but  not  of 
0,  and  wishes  to  express  his  uncertainty  about  0 in  the  light  of  the 
observation.  Bayes  theorem  allows  him  to  calculate  this  as 

P (9  j x)  p(x|0)  p(0) 

the  constant  not  involving  0.  (In  fact  it  is  p(x)  ^ . ) 

At  this  point  we  had  better  improve  on  our  rather  sloppy  notation 

in  order  to  emphasize  the  nature  of  the  last  result.  Bayes  theorem  would 
better  be  written 

p0(*  |x)  Px(xj  •)  p0(*) 

to  emphasize  the  fact  that  x is  known  and  that  the  functions  are  all 

functions  of  0.  The  first  and  last  of  the  three  functions  are  clearly 

probability  densities,  never  negative  and  adding  to  1 (£„  p (0|x) 

= Eg  Pg(0)  = 1.)  The  other,  p (x|«),  is  nonnegative  but  does  not  typi- 
cally add  to  1:  it  is  not  a probability  (as  a function  of  0)  and  is 
termed  the  likelihood  of  0,  given  x.  In  words  the  above  result  is 
described  by  saying  that  the  probability  of  0 given  x is  propor- 
tional to  the  product  of  the  likelihood  of  0 given  x and  the  prob- 
ability of  9 prior  to  x.  We  have  proved  the  following 

Theorem.  The  uncertainty  about  9,  given  x,  depends  on  x only 
through  the  likelihood  function  p (x|*). 

Expressed  differently,  if  two  data  values,  x^  and  x^,  have  the 
same  likelihood,  then  the  uncertainty  about  9,  given  x,  is  the  same 
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Px(x| 0)dx  = a over  the  rejection 


t(x)  p (x|0)dx  = 0 for  all  0.  In  both  cases , 
X 


as  that  given  x Or,  in  another  form,  the  likelihood  function  is  a 
sufficient  statistic.  It  is  often  referred  to  as  the  likelihood 
principle. 

This  is  a surprising  result.  To  appreciate  the  reason  for  the 
astonishment,  recall  that  it  has  been  proved  on  the  basis  of  some 
reasonable  axioms  about  uncertainty,  so  should  reasonably  apply  in 
practical  situations  of  uncertainty.  In  fact,  almost  all  methods  used 
in  statistics  violate  the  principle.  It  is  easy  to  see  this  since  the 
only  probability  admitted  in  sampling  theory  statistics  is  p^ ( * | 0)  for 
each  0 and  the  procedures  used  involve  integrals  over  x-values.  For 
example,  a significance  test  has 
region  R for  values  of  0 constituting  the  null  hypothesis;  an  unbiased 
estimate  t(x)  employs 
as  in  others,  values  of  x besides  that  observed  are  used,  so  violating 
the  likelihood  principle.  The  method  of  maximum  likelihood  is  an  obvious 
exception,  but  that  is  unsatisfactory  for  a different  reason  that  will 
appear  below. 

We  now  have  a wonderful  way  of  testing  the  ideas  developed  in  this 
paper.  According  to  Popper,  a theory  is  valuable  if  several  testable 
results  can  be  deduced  from  it.  And  the  theory  fails  as  soon  as  a test 
fails.  Our  ideas  are  certainly  valuable  in  this  sense  because  all  the 
results  of  the  rich,  probability  calculus  are  available.  And  here  we 
have  a test:  which  ideas  are  better,  those  based  on  the  likelihood 
principle,  or  those  on  conventional  statistics?  There  is  no  room  here 
to  explore  this  question  generally,  so  we  confine  ourselves  to  an 
example. 
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3.  Inference  for  a Galton-Watson  Process 


Let  us  explore  the  likelihood  principle  in  a situation  that  is  of 
current  research  interest.  Consider  a Galton-Watson  process  in  which 
at  each  generation,  each  individual  alive  then  gives  rise,  independently 
of  all  other  individuals,  to  a number  r of  offspring  which  constitute 
the  next  generation.  Suppose  that  the  probability  density  for  r is 
the  same  for  all  individuals  and  is  known  up  to  an  unknown  parameter 
0;  write  it  pr(r|9).  For  simplicity  suppose  r is  never  zero  so  that 
extinction  is  not  possible:  the  conclusion  is  unaffected  if  this  res- 
triction is  removed.  Since  our  object  is  to  illustrate  the  differences 
between  the  likelihood  approach  and  sampling  theory  ideas,  and  not  to 
carry  out  detailed  calculations,  let  us  specialize  to  the  case  of  the 
geometric  distribution  with  Pr(r|0)  = (1-0)0  for  r >_  1 and 
0 < 0 < 1. 


Now  suppose  the  process  is  observed  for  N generations  starting 
with  a single  individual  in  the  first  generation.  Let  the  numbers  of 
offspring  observed  be  r^  from  that  first  generation  individual, 
r21’  r22’  ’’’’  r2r  from  the  r^  at  the  second  generation;  and  so 
on  up  to  r^,  rN2’  rNS  w^ere  ® is  the  total  at  the  (N-l)- 

generation.  The  probability  for  this  data  set  x = (r.^:  r21*  r22’ 

. . . , ^ 2x  ' r3i>  • • • • • • • » clearly 

rii-1  r2r1  r22_1  r2rii_1  r3r1 

(1-0)0  11  (1-9)9  (1-9)9  ...  (1-0)0  (1-0)0 


...  (1-0)0 


rNS-1 


i R^-R^.-l 

or  simply  (1-9)  9 where  is  the  total  number  of 

individuals  up  to  and  including  the  iC^  generation.  Essentially  each 
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parent  contributes  a term  (1-0) /8;  each  offspring  a term  0.  The  total 

til 

number  of  parents  is  R^_^  (the  parenthood  of  those  at  the  N genera- 
tion is  not  observed)  and  of  offspring  R^-l,  since  the  original  individual 
was  not  observed  as  an  offspring.  This  is  the  likelihood  function,  and 
according  to  the  likelihood  principle  no  other  aspect  of  the  data  is  rele- 
vant to  the  uncertainty  about  0 and  (R^,  Ejj  ^)  is  a sufficient  statistic. 
(Notice  it  does  not  involve  N,  the  number  of  generations.) 

Another,  more  interesting,  feature  of  the  likelihood  is  that  it  is 
the  same  as  that  for  a random  sample  of  size  ^ from  the  geometric 
distribution  that  yields  a sum  R^-l.  To  see  this,  note  that  a random 

sample  of  size  n with  values  r.. , r_,  ...,  r gives  a likelihood 
n Z(r±-1)  1 1 n 

(1-0)  0 : writing  n = R^  ^ and  Er_^  = R^-l  gives  the  result. 

Consequently  we  have  two  data  sets  (x^  and  x^  in  an  earlier  notation) 
which  have  the  same  likelihood  function  and  therefore  for  which  the  uncer- 
tainties about  0 should  be  the  same  according  to  the  principle:  one 
data  set  observes  for  a fixed  number  N of  generations,  the  other  observes 
for  a fixed  sample  size  n.  Now  the  latter  situation  is  standard  in  the 
statistical  literature  and,  for  example,  the  maximum  likelihood  estimate 

of  0,  E(r^-1)/Er^,  is  asymptotically  normal  with  mean  equal  to  the  true 

2 

value  of  0 and  variance  0(1-0)  /n,  the  latter  being  obtained  from  the 
inverse  of  the  expectation  of  minus  the  second  derivative  of  the  log- 
likelihood.  According  to  the  likelihood  principle  the  same  asymptotic 
inference  as  stated  in  the  last  sentence  should  be  available  for  the 
original  case  of  fixed  generation  number.  Is  this  so? 

The  maximum  likelihood  example  remains  as  before;  in  the  changed 
notation  it  is  (R^-R^  ^-1) / (R^-l) ; but  the  variance  is  altered  because 
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the  expectation,  which  previously  treated  n as  fixed  and  only  £r^  as 
random,  now  has  both  these  quantities,  involving  and  , as  random. 

Detailed  calculation  shows  that  the  expectation  of  minus  the  second  deri- 
vative of  the  log-likelihood  when  inverted  gives  (1-0)^  0^/{l-(l-0)^}, 

N 2 

or  asymptotically  (1-0)  0 , of  quite  different  form  from  the  earlier 

2 

expression  0(1-0)  /n.  Consequently  the  likelihood  principle  is  in 
direct  conflict  with  conventional  statistical  practice  that  uses  a 
sampling  variance  to  judge  the  imprecision  of  an  estimate.  The  reason 
is  not  hard  to  see.  The  computation  of  the  variance  involves  an  integra- 
tion of  px(x|0)  over  a suitable  space  which  is  different  according  to 
whether  the  generation  number  N or  the  number  n of  parents  is  fixed. 
The  reader  can  judge  for  himself  which  seems  sensible,  remembering  that 
if  he  favours  the  sampling-variance  approach  he  is  somewhere  violating 
one  of  the  axioms  described  above.  For  my  part,  the  likelihood  approach 
seems  clearly  correct:  we  have  here  ^ parents  independently  pro- 
ducing offspring,  why  should  it  matter  that  is  random  or  fixed? 

Another  related  difficulty  with  the  sampling-theoretic  approach  is  that 
in  many  cases  it  is  hard  to  be  sure  what  the  relevant  space  for  integra- 
tion is.  For  example,  suppose  the  N generations  had  been  observed  but 
that  some  parents  had  left  the  system  during  the  period  of  observation 

so  that  there  was  no  knowledge  about  their  offspring.  The  likelihood 
P Q-P 

remains  (1-6)  0 where  P and  Q are  respectively  the  number  of 

parents  and  the  number  of  offspring.  In  the  original  case  of  complete 
observations  it  is  particularly  curious  to  fix  N since  it  is  not  even 
part  of  the  sufficient  statistic. 
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In  this  paper  we  have  shown  that  it  is  possible  to  present  statistics 
within  the  framework  of  a simple  axiom-system  capturing  the  concept  of 
uncertainty;  that  this  system  leads  to  the  result  that  uncertainty  must 
be  described  probabilistically;  that  one  probability  result,  Bayes  theorem, 
leads  to  the  likelihood  principle,  and  that  this  principle  is  in  direct 
conflict  with  statistical  practice.  We  therefore  have  a piece  of  theory 
that  cannot  be  ignored  by  the  most  applied  of  statisticians  because  it 
strongly  affects  practice. 
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